(12) 



UK Patent Application 



(19) 



GB „, 2 231 246„»A 

(43) Date of A publication 07.11.1990 



(21) Application No 9005142.6 

(22) Data of filing 07.03.1990 



(30) Priority data 
(31) 01053899 



(32)08.03.1989 (33) JP 



(71) Applicant 

Kokutal Denshln Denwa Kabushlki KaUha 

(Incorporated In Japan) 

2-3-2 Niahiehlnluku, Shlnjuku-ku, Tokyo-to, Japan 

(72) Inventors 

Masahlde Kaneko 
Atsusht Koike 
Yothinorl Hatorl 
Selichi Yamamoto 
Norlo Higuchl 

(74) Agent and/or Address for Service 
Elklngton & Fife 

Beacon House, 113 Kingsway, London, WC2B 6PP, 
United Kingdom 



(51) INTCL* 

G06F 15/72 

(52) UK CL (Edition K) 

H4T TCJA T126 T128 

(56) Documents cited 

EP 0225729 A1 EP 0179701 A1 EP 0056507 A1 

(58) Field of search 

UK CL (Edition K) H4F FGH FGJ FGS, H4T TBAX 

TBEX TCGD TCGX TCHX TCJA TCXX 

INT CL* G06F 

Online databases: WPI 



(54) Converting text input into moving-face picture 

(57) A moving picture of a face with mouth-shape variations corresponding to a text sentence input is produced. The input 
sentence is divided into a train of phonemes and a speech synthesis technique capable of outputting a votce feature of each 
phoneme and its duration is utilized. Based on the voice feature, a mouth-shape feature corresponding to each phoneme ts 
determined 3. Based on the mouth-shape feature, the value of a mouth-shape parameter is determined 5, 4 for 
representing a mouth shape. Further, the value of the mouth-shape parameter for each frame of the moving picture is 
controlled 2 in accordance with the duration of each phoneme, thereby synthesizing the moving face picture having 
mouth-shape variations which agree with the speech output. 
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PICTURE SYNTHESIZING METHOD AND APPARATUS 

The present invention relates to a method for synthe- 
sizing a picture through digital processing, and more 
particularly, to a system for synthesizing a (still or 
moving) picture of a face wh.ch represents changes in the 
shape of a mouth accompanyin , the production of a speech 
output . 

When a man utters a voc .1 sound, vocal information is 
produced by an articulator, ,nd at the same time, his mouth 
moves as he utters (i.e. ch. :ges in the shape of the mouth 
in outward appearance). A ■ ::thod, which converts a sentence 
input as an input text to s eech information and outputs it, 
is called a speech synthesi , and this method has achieved a 
fair success- In contrast hereto, few reports have been 
published on a method for ■ roducing a picture of a face 
which has mouth-shape vara tions in correspondence to an 
input sentence, except the following report by Kiyotoshi 

Matsuoka and Kenji Kurosu 

The method proposed / Matsuoka and Kurose is disclosed 
in a published paper [Kij toshi Matsuoka and Kenji Kurose: 
"A moving picture prograj for a training in speech reading 
for the deaf," Journal o the Institute of Electronic 
Information and Communic tion Engineers of Japan, Vol, J70-D, 
No. 11, pp. 2167-2171 (1 ivember 1987)] 

Besides, there has also been reported, as a related 
prior art, a method foi presuming mouth-shape variations 
corresponding to an in- .t text. This method is disclosed in 
a published paper [Shi no Morishima, Kiyoharu Aizawa and 
Hiroshi Hara: "Studies of automatic synthesis of expressions 
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on the basis of speech information," 4TH NICOGRAPH article 
contest, Collection of Articles, pp. 139-146, Nihon Computer 
Graphics Association (November 1988) ] . This article pro- 
poses a method which calculates the logarithmic mean power 
5 of input speech information and controls the opening of the 
mouth accordingly and a method which calculates a linear 
prediction coefficient corresponding to the format charac- 
teristic of the vocal tract and presumes the mouth shape. 

The method by Matsuoka and Kurose has been described 
above as a conventional method for producing pictures of a 
face which have mouth-shape variations : 'corresponding to a 
sentence (an input text) being input, but this method poses 
such problems as follows: Although a vocal sound and the 
mouth shape are closely related to each other in utterance, 
15 the method basically syllabicates the sentence and selects 
mouth- shape patterns on the basis of the correspondence in 
terms of characters, and consequently, the correlation 
between the speech generating mechanism and the mouth-shape 
generation is insufficient. This introduces difficulty in 
producing the mouth shape correctly in correspondence to the 
speech output. Further, although a phoneme (a minimum unit 
in utterance, a syllable being composed of a plurality of 
phonemes) differs in duration in accordance to the connection 
between it and the preceding and following phonemes , the 
25 method by Matsuoka and Kurose fixedly assigns four frames to 
each syllable, and consequently, it is difficult to represent 
natural mouth-shape variations in correspondence to the input 
sentence. Moreover, in the case of outputting the sound and 
the mouth-shape picture in response to the sentence being 
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input, it is difficult to match them with each other. 

The method proposed by Morishiraa, Aizawa and Harashima 
is to presume the mouth shape on the basis of input speech 
information, and hence cannot be applied to the production 
5 of a moving picture which has mouth-shape variations corre- 
sponding to the input sentence. 

In view of the above, an object of the present invention 
is to provide picture synthesizing method and apparatus 

10 which permit the representation of mouth-shape variations, 

which correspond accurately to speech outputs and agree with 
the durations of phonemes. 

According to an aspect of the present invention, the 
picture synthesizing method for generating a moving face 

15 picture with mouth-shape variations corresponding to a 

sentence input divides the sentence input into a train of 
phonemes and utilizes the speech synthesis technique capable 
of outputting a voice feature of each phoneme and its 
duration. Based on the voice feature, a mouth-shape feature 

20 corresponding to each phoneme is determined. Based on the 

mouth-shape feature, the value of a mouth-shape parameter is 
determined for representing a concrete mouth shape. Further, • 
the value of the mouth-shape parameter for each frame of the 
moving picture is controlled in accordance with the duration 

25 of each phoneme, thereby synthesizing the moving face 

picture having mouth-shape variations which agree with the 

speech output . 

According to another aspect of the present invention, 
the picture synthesizing apparatus comprises: an input 
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terminal for receiving a sentence input; a speech synthe- 
sizer which divides the input sentence from the input termi- 
nal into a train of phonemes and outputs a voice feature for 
each phoneme and its duration; a converter which converts 
the voice feature for each phoneme into a mouth-shape 
feature; a conversion table which establishes correspondence 
between various mouth-shape features and mouth-shape parame- 
ters representing concrete mouth shapes; a unit which 
obtains from the conversion table a mouth-shape parameter 
corresponding' to the mouth-shape feature for each phoneme; 
a time adjuster wherein the value of the mouth-shape parame- 
ter output from the unit is controlled in accordance with 
the duration for each phoneme from the speech synthesizer so 
as to generate a moving picture provided as a train of 
15 pictures spaced apart for a fixed period of time; and a 

picture generator which generates a picture in accordance 
with the mouth-shape parameter output from said unit under 
control of the timing control section. 

According to still another aspect of the present in- 
20 vention, the moving picture synthesizing apparatus comprises: 
an input terminal for receiving a sentence input; a speech 
synthesizer which divides the input sentence from the input 
terminal into a train of phonemes and outputs a voice 
feature for each phoneme and its duration; a converter which 
converts the voice feature for each phoneme into a mouth- 
shape feature; a conversion table which establishes corre- 
spondence between various mouth-shape features and mouth- 
shape parameters representing concrete mouth shape; a unit 
which obtains from the conversion table a mouth-shape 
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parameter corresponding to the mouth-shape feature for each 
phoneme; a time adjuster wherein the value of the mouth- 
shape parameter output from the unit is controlled in 
accordance with the duration for each phoneme from the 
5 speech synthesizer so as to generate a moving picture 

provided as a train of pictures spaced apart for a fixed 
period of time; a picture generator which generates a 
picture in accordance with the mouth-shape parameter output 
from the unit under control of the time adjuster; 
a transition detector for detecting a transition from a 
certain phoneme to the next in accordance with the output of 
the time adjuster; a memory capable of storing, for at least 
more than one frame period, the values of the mouth-shape 
parameters used in the picture generator; and a mouth-shape 
15 parameter modifier for obtaining an intermediate value 

between the value of the mouth-shape parameter stored in the 
memory and the value of the mouth-shape parameter provided 
from the unit. During the transition from a certain phoneme 
to the next an intermediate mouth shape is generated, pro- 
ducing a moving face picture with smooth mouth-shape 
variations . 



20 



The present invention will be described in detail below 
in comparison with prior art with reference to accompanying 

25 drawing, in which: 

Fig. 1 is a block diagram corresponding to an 

embodiment of the present invention; 

Figs. 2A and 2B are diagrams showing examples of 
parameter for representing a mouth shape; 
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Fig. 3 is a block diagram corresponding to an example 
of the operation of a time adjuster employed in the present 
invention; 

Fig. 4 is a block diagram corresponding to another 
embodiment of the present invention; 

Fig. 5 is a block diagram corresponding to an example 
of the operation of a transition detector employed in the 

embodiment shown in Fig. 4; and 

Fig. 6 is a block diagram corresponding to the oper- 
ation of a conventional picture synthesizing system. 

To make differences between prior art and the present 
invention clear, an example of prior art will first be 
described. 

15 The method of the first-mentioned paper is executed in 

the form of a program, and the basic concept of obtaining 
mouth-shape variations corresponding to the input sentence 

is shown in Fig. 6. 

in Fig. 6 reference numeral 50 indicates a syllable 

20 separator, 51 a unit making correspondence between syllables 
and mouth-shape patterns, 52 a table containing corre- 
spondence between syllables and mouth-shape patterns, 53 a 
m outh-shape selector, and 54 a memory for mouth-shape. Next, 
the operations of these units will be described in brief. 

25 The syllable separator 50 divides an input sentence (an 

input text) in syllables. For instance, an input "kuma" in 
Japanese is divided into syllables «ku" and "ma". The table 
52 is one that prestores the correspondence between prepared 
syllables and mouth-shape patterns. The syllables each 
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adjuster, 3 a speech feature to mouth-shape feature con- 
verter, 4 a conversion table of mouth-shape features to 
mouth-shape parameters, 5 a unit obtaining mouth-shape 
parameters, 6 a picture generator, 10 a gate, 900 an input 
5 text (sentence) terminal, and 901 an output picture terminal. 

Next, the operation of each unit will be described. 
The speech synthesizer 1 synthesizes a speech output corre- 
sponding to an input sentence. Various systems have been 
proposed for speech synthesis, but it is postulated here to 
10 utilize an existing speech rule synthesizing method which 
employs a Klatt type format speech synthesizer as a vocal 
tract model, because it is excellent in matching with the 
mouth-shape generation. This method is described in detail., 
in a published paper [Seiichi Yamamoto, Norio Higuchi and 
15 Tohru Shimizu: "Trial Manufacture of a Speech Rule Synthe- 
sizer with Text-Editing Function," Institute of Electronic 
Information and Communication Engineers of Japan, Technical 
Report SP87-137 (March 1988)]. No detailed description will 
be given of the speech synthesizer, because it is a known 
technique and is not the applied object of the present 
invention. The speech synthesizer needs only to output 
information of a vocal sound feature and a duration for each 
phoneme so as to establish accurate correspondence between 
generated voice and mouth shapes. According to the method 
25 by Yamamoto, Higuchi and Shimizu, the speech synthesizer is 
adapted to output vocal sound features such as an articu- 
lation mode, an articulation point, a distinction between 
voiced and voiceless sound and pitch control information and 
information of a duration based thereon, and fulfils the 
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represent a group of sounds "a", "ka" , etc. The mouth-shape 
patterns include big ones (<A> <I> <U> <E> <K> , etc.) and 
small one (<u> <o> <k> <s>, etc.) and indicate the kinds of 
the mouth shapes. They are used to prestore as a table the 
correspondence between the syllables and the mouth-shape 
patterns in such forms as <A><*><A> for "a" and <K><*><A> 
for "ka", for example. In this case, the symbol <*> indi- 
cates an intermediate mouth shape. The unit 51 reads out, 
for each syllable from the syllable separator 50, the corre- 
sponding mouth-shape pattern from the table 52. The memory 
for mouth-shape 54 is one that prestores, for each of the 
above-mentioned mouth-shape patterns, a concrete mouth shape 
as a graphic form or shape parameter. The mouth shape 
selector 53, when receives mouth-shape patterns from the 
unit 51, sequentially refers to contents of the memory for 
mouth-shape 54 to select and outputs concrete mouth shapes 
as output pictures. At this time, intermediate mouth shapes 
(intermediate between the preceding following mouth shapes) 

t 

are also produced. For providing the output as a moving 
picture, the mouth shape for each syllable is fixedly 

assigned four frames. 

In the following, the present invention will be 

described . 



Fig. 1 is a block diagram for explaining an embodi- 
ment of the present invention. Now, assume that input 
information is an input text (a sentence) obtainable from a 
keyboard or file unit such as a magnetic disk. In Fig. 1 
reference numeral 1 indicates a speech synthesizer, 2 a time 
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requirement. Other speech synthesizing methods can be 
employed, as long as they provide such information. 

Moreover, if the information of a vocal sound feature 
and a duration for each phoneme is obtained, the present 
5 invention can be applied to an input text of English, French, 
German, etc. as well as Japanese. 

The time adjuster 2 is provided to control the input of 
a mouth-shape parameter into the picture generator 6 on the 
basis of the duration of each phoneme (the duration of an 
10 i-th phoneme being represented by t ± ) which is provided from 
the speech synthesizer 1. That is, when -a picture (a moving 
picture, in particular) is output as a television signal of 
30 frames per second by the NTSC television system, for 
example, it is necessary that the picture be generated as 

. . * i /in sppnnfl The operation of the time 

15 information for each secona. j.u<= ^ 

adjuster 2 will be described in detail later on. 

The converter 3 converts the vocal sound feature from 
the speech synthesizer 1 to a mouth-shape feature correspond- 
ing to the phoneme concerned. The mouth-shape features are, 
20 for example, (1) the degree of opening of the mouth (appreci- 
ably open - completely shut), (2) the degree of roundness of ■ 
lips (round * drawn to both sides), (3) the height of the . 
lower jaw (raised - lowered), and. (4) the degree to which the 
tongue is seen. Based on an observation of how a man 
25 actually utters each phoneme, the correspondence between the 
vocal sound feature and the mouth-shape feature is formulated 

For example, in the case of a Japanese sentence 
"konnichiwa" being input, vocal sound features are converted 
to mouth-shape features as follows: 
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ft it (voiceless sound) lvO lh4 jaw4 

k lv2 lhx jaw 2 tbck 

0 lv2 lhl jaw 2 

5 In the above lv, lh and jaw represent the degree of opening 
of the mouth, the degree of roundness of lips., and the 
height of the lower jaw, respectively, the numerals repre- 
sent their values, x indicates that their degree is deter- 
mined by preceding and succeeding phonemes, and tbck 

10 represents the degree to which the tongue is seen. (In this 
case, it is indicated that the tongue is slightly seen at 

the back of the mount.) 

The conversion table 4 for converting the mouth- shape 
feature to the corresponding mouth-shape parameter is a 
15 table which provides the parameter values for representing a 
concrete mouth shape for each of the afore-mentioned mouth- 
shape features. Examples of parameters for represwnting 
mouth shapes are shown in Figs. 2A and 2B. Fig. 2A is a 
front view of the mouth portion. The mouth shape is defined 
20 by the positions of eight points ^ through Pg , the degree 
to which upper and lower teeth are seen is defined by the 
positions of points Q, and Q 2 , and the thicknesses of upper 
and lower lips are defined by values h 1 and h 2 - Fig. 2B is 
a side view of the mouth portion, and inversions of the 
25 upper and lower lips are defined by angles ^ and 6 r These 
parameters are adopted for representing natural mouth-shapes 
However, more kinds of parameters can be utilized. Mouth- 
shapes may also be represents by parameters and indications 
other than those of Figs. 2A and 2B. In the conversion 



table 4 there are prestored, in the form of a table, sets of 
values of the above-mentioned parameters ? 1 to P g , Q-^ Q 2 > 

h , h , 0, and 6, predetermined on the basis of the results 

1 2 1 ^ 

of measurements of the mouth shapes of a man when he 
actually utters vocal sounds. 

In response to the mouth-shape feature corresponding to 
the phoneme concerned, provided from the speech feature to 
mouth-shape feature converter 3, the unit 5 refers to the 
conversion table 4 to read out therefrom a set of values of 
mouth-shape parameters for the phoneme. 

The gate 10 is provided for controlling whether or not 
the above-mentioned mouth-shape parameters for the phoneme 
are sent to the picture generator 6, and this sends the 
mouth-shape parameters to the picture generator 6 by the 
number of times specified by the time adjuster 2 (a value 
obtained by multiplying the above-mentioned number of times 
by 1/30 second being the time for displaying the mouth shape 

for the phoneme) . 

The picture generator 6 generates a picture of the 
mouth based on the mouth-shape parameters sent for each 1/30 
second from the unit 5 via the gate 10. A picture including 
the whole face in addition to the mouth portion is generated 
as required. The detailes of the generation of a picture of 
a mouth or face based on mouth-shape parameters are described 
in, for example, a published paper [Masahide Kaneko, 
Yoshinori Hatori and Kiyoshi Koike, "Detection of Shape 
Variations and Coding of a Moving Face Picture Based on a 
Three-Dimensional Model," Journal of the Institute of Elec- 
tronic information and Communication Engineers of Japan, B, 
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vol. J71-B, No. 12, pp. 1554-1563 (December 1988)]. In 
rough terms, a three-dimensional wire frame model is at 
first prepared which represents the three-dimensional 
configuration of the head of a person, and mouth portions 
(lips, teeth, jaws, etc., in concrete terms) of the three- 
dimensional wire frame model are modified in accordance with 
mouth-shape parameters provided. By providing to the modi- 
fied model information specifying the shading and color of 
each part of the model for each picture element, it is 
possible to obtain a real picture of the mouth or face. 

Now, the operation of the time adjuster 2 will 
be described in detail. Fig. 3 is a block diagram explana- 
tory of the structure and operation of the time adjuster 2 . 
In Fig. 3 reference numeral 21 indicates a delay, 22 a 
15 comparator, 23 and 24 memories, 25 and 26 adders, 27 a 

switch, 28 and 29 branches, 30 a time normalizes 201 and 
202 output lines of the comparator 22, 902 an initial reset 
signal terminal, 903 a constant (1/30) input terminal, and 
920 and 921 terminals of the switch 27. Next, the operation 
of each of these parts will be described. The memory 23 is 
provided for storing a total duration, .1^, to an I-th 
phoneme. Prior to the start of picture synthesis, a zero is 
set in the memory 23 by an initial reset signal from the 
terminal 902. When the duration of the I-th phoneme is pro- 
provided from the speech synthesizer 1, the total duration 
I I 1 t. to an (I-l)th phoneme stored in the memory 23 and the 

'1 1 

duration tj of the I-th phoneme are added by the adder 2 5 to 
obtain the sum .^t.-, and the delay 21 serves to store the 
total duration *^t. to the (I-l)th phoneme until processing 
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for the (I+l)th phoneme is initiated. In response to the 

T _ 1 

output .It. of the delay 21, the time normalizer 30 
i=l i 1 _ 1 

obtains an N which satisfies (1/30) x N < < (1/30) * 

(N+l) , and outputs a value (1/30) * n, where N is an integer 

5 and 1/30 is a constant which provides a one-frame period of 

1/30 second. The switch 27 is connected to the terminal 920 

by the output 202 from the comparator 22 when processing for 

the I-th phoneme is started. At this time, the sum t of the 

output 1/30 x n of the time normalizer 30 and the 

10 constant 1/30 is calculated by the adder 26. The comparator 

1 

22 compares the value t and the value i ^ 1 t i' and 

provides a signal on the output line 201 or 202 depending on 

whether t < .Lt. or t > ,Lt. , The latter case means the 

i=l l i=l i 

expiration of the duration of the I-th phoneme, issuing 

15 through the output line 202 an instruction to the speech 

synthesizer 1 to output information of the (I+l)th 

phoneme, an instruction to the memory 24 to reset its 

contents, an instruction to the switch 27 to connect the 

same to the terminal 920, and an instruction to the delay 21 

1 

to output the value of the delayed duration i | 1 t i - The 
memory 24 is provided for temporarily store the output of 

the adder 26. The switch 27 is connected to the terminal 

1 

921 while t < Z t. holds, during which the adder 26 renews 

i = l 1 

the preceding sum t by adding thereto the constant 1/30 for 

1 

25 each frame. In this way , while t * .^t. holds, the compa- 
rator 22 provides the signal on the output line 201 to 
enable the gate 10 in Fig. 1, through which mouth-shape 
parameters corresponding to the I-th phoneme are supplied to 
the picture generator 6 duration of the I-th phoneme. 
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The above is the first embodiment of the present 
invention. In the first embodiment, when the I-th phoneme 
changes to the (I+l)th phoneme, the mouth-shape parameters 
of the former discontinuously change to the mouth-shape 
parameters of the latter. In this instance, if the mouth- 
shape parameters of the both phonemes do not differ widely 
from each other, the synthesized moving picture will not be 
so unnatural. When a person utters vocal sounds, however, 
his mouth shape changes continuously; therefore, when the 
I-th phoneme changes to the (I+l)th phoneme, it is desirable 
that the mouth shape of the moving picture changes continu- 



ously. 



Fia. 4 is a block diagram explanatory of another 
embodiment of the present invention designed to meet with 
the above requirement. In Fig. 4 reference numeral 7 indi- 
cates a mouth-shape parameter modifier, 8 a transition 
detector, 9 a memory, 40 a switch, and 910 and 911 terminals 
of the switch 40. This embodiment is identical in con- 
struction with the Fig. 1 embodiment except the above. Now, 
a description will be given of the operations of the newly 
added units. 

The transition detector 8 is to detect the transition 
from a certain phoneme (the I-th phoneme, for example) to 
the next one (the (I+l)th phoneme). Fig. 5 is a block 
diagram explanatory of the operation of the transition 
detector 8 according to the present invention. Reference 
numeral 81 indicates a counter, 82 a decision circuit, and 
210 and 211 output lines. The counter 81 is reset to zero 
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when the comparator 22 provides a signal on the output line 
202, and the counter 81 is incremented by one whenever the 
comparator 22 provides a signal on the output line 201. The 
decision circuit 82 determines whether the output of the 
counter 81 is a state "1" or not and, when it is the state 
"1", provides a signal on the output line 210, because the 
state "1" indicates the occurrence of transition from a 
certain phoneme to the next. When the counter output is a 
state "2" or more, this means that the current phoneme still 
lasts, and the decision circuit 82 provides a signal on the 

output lih'e 211. 

The memory 9 is provided for storing, for at least one 
frame period, the mouth-shape parameters used for synthe- 
sizing a picture of the preceding frame. The mouth-shape 
15 parameter modifier 7 obtains, for instance, intermediate 

values between the mouth-shape parameters of the preceding 
frame stored in the memory 9 and the mouth-shape parameters 
for the current phoneme which are provided from the unit 5 
to provide such intermediate values as mouth-shape parame- 
ters for synthesizing a picture of the current frame. The 
switch 40 is connected to the terminal 910 or 911, depending 
on whether the transition detector 8 provides a signal on 
the output line 210 or ^211. Consequently, the intermediate 
values between the mouth-shape parameters for two phonemes, 
25 available from the mouth-shape parameter modifier 7, or the 
mouth-shape parameters for the current phoneme are supplied 
to the picture generator 6, depending on whether the switch 
40 is connected to the terminal 910 or 911. While in the 
above the intermediate values between the mouth-shape 
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actual film shooting (the production of a television program 
or movie, for example), an automatic response unit and a 
man-machine interface utilizing a speech and a picture, and 
the conversion of medium from a sentence to a speech and a 
moving picture. Hence, the present invention is of great 

practical utility. 
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parameters of a certain phoneme and the next are produced 
for only one frame, it is also possible to implement more 
smooth mouth-shape variations by producing such intermediate 
values at more steps in accordance with the counting state 
5 of the counter 82, for instance. 

As described above, the present invention is directed 
to a system for synthesizing a moving picture of a person's 
face which has mouth-shape variations corresponding to a 
sentence input. However, if it is possible to utilize a 
speech recognition method by which, even if speech infor- 
mation is input, it can be divided into a train of phoneme 
and a voice feature for each phoneme and its duration can be 
output, then a moving picture with mouth-shape variations 
corresponding to the input speech information can also be 
15 synthesized by replacing the speech synthesizer 1 in the 
present invention by a speech detector which performs 
such operations as mentioned above. 

As described above, the present invention permits the 
synthesis of a moving picture which has an accurate corre- 
spondence between a sentence input and a speech output and 
mouth-shape variations corresponding to the duration of each 
phoneme and consequently natural mouth-shape variations well 
matched with the speech output. 

The prior art can only synthesize a speech output but 
the present invention allows ease in producing not only such 
a speech output but also a moving picture having natural 
mouth-shape variations well matched with the speech output. 
Accordingly, the present invention is applicable to the 
production of a moving picture without the necessity of 
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CLAIMS 

1. a picture synthesizing method for synthesizing a moving 
picture of a person's face which has mouth-shape variations 
corresponding to a sentence input, 
characterized by the steps of: 

dividing the sentence input into a train of phonemes; 
utilizing of a speech synthesis technique capable of 
outputting a voice feature of each phoneme of the train of 

phonemes and its duration; 

determining a mouth-shape feature corresponding to each 

a 

phoneme on the basis of the voice feature; 

determining the value of a mouth-shape parameter for 
representing a concrete mouth shape on the basis of the 

mouth-shape feature; and 

controlling the value of the mouth-shape parameter for 
each phoneme for each frame of the moving picture in 
accordance with the duration of each phoneme, thereby 
synthesizing the moving picture having mouth-shape vari- 
ations matched with a speech output. 
2. A picture synthesizing apparatus comprising: 

an input terminal for receiving a sentence input; 
a speech synthesizer capable of dividing the input 
sentence into a train of phonemes and outputting a voice 
feature of each phoneme and its duration; 

a converter for converting the voice feature for each 
phoneme into a mouth-shape feature; 

a conversion table having established correspondence 
between various mouth-shape features and mouth-shape parame 
ters for representing concrete mouth shapes; 



means for obtaining from the conversion table a mouth- 
shape parameter corresponding to the mouth-shape feature for 
each phoneme provided in the converting section; 

a time adjuster whereby the value of the mouth-shape 
parameter output from said means for obtaining is controlled 
in accordance with the duration of each phoneme from the 
speech synthesizer so as to produce a moving picture as a 
train of pictures spaced apart for a fixed period of time? 
and 

a picture generator for generating the picture in 
accordance with the values of the mouth-shape parameters 
from said means for obtaining mouth-shape parameters under 
control of the time adjuster. 

3. A picture synthesizing apparatus according to claim 2, 
characterized by a transition detector for detecting a 
transition from a certain phoneme to the next in accordance 
with the output of the time adjuster, a memory capable of 
storing for at least one frame period the values of the 
mouth-shape parameters used in the picture generator, and a 
mouth-shape parameter modifier for obtaining an intermediate 
value between the value of the mouth-shape parameter stored 
in the memory and the value of the mouth-shape parameter 
provided from siad means for obtaining the mouth-shape 
parameters, whereby during the transition from the certain 
phoneme to the next an intermediate mouth shape is generated, 
priducing the moving picture of a person's face with smooth 
mouth-shape variations. 



means for obtaining from the conversion table a mouth- 
shape parameter corresponding to the mouth-shape feature for 
each phoneme provided in the converting section; 

a time adjuster whereby the value of the mouth-shape 
parameter output from said means for obtaining is controlled 
in accordance with the duration of each phoneme from the 
speech synthesizer so as to produce a moving picture as a 
train of pictures spaced apart for a fixed period of time; 
and 

a picture generator for generating the picture in 
accordance with the values of the mouth-shape parameters 
from said means for obtaining mouth-shape parameters under 
control of the time adjuster. 

3. A picture synthesizing apparatus according to claim 2, 
characterized by a transition detector for detecting a 
transition from a certain phoneme to the next in accordance 
with the output of the time adjuster, a memory capable of 
storing for at least one frame period the values of the 
mouth-shape parameters used in the picture generator, and a 
mouth-shape parameter modifier for obtaining an intermediate 
value between the value of the mouth-shape parameter stored 
in the memory and the value of the mouth-shape parameter 
provided from siad means for obtaining the mouth-shape 
parameters, whereby during the transition from the certain 
phoneme to the next an intermediate mouth shape is generated, 
priducing the moving picture of a person's face with smooth 
mouth-shape variations. 
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4. A picture synthesizing method for synthesizing a 
moving picture of a person's face which has mouth shaped 
variations corresponding to a sentence input substantially 
as herein described with reference to Figure 1 with or 
without reference to any of Figures 2 to 5 of the 
accompanying drawings. 



5. A picture synthesizing apparatus substantially as 
herein described with reference to Figure 1 with or without 
reference to any of Figures 2 to 5 of the accompanying 

■ 

I* 

drawings . 
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